{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 基础"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## xpath简介"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。**\n",
    "\n",
    "**XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 都构建于 XPath 表达之上。**\n",
    "\n",
    "**因此，对 XPath 的理解是很多高级 XML 应用的基础。**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 推荐工具\n",
    "\n",
    "浏览器插件：XPath Helper\n",
    "\n",
    "阅读xml软件：SketchPath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## xpath的一般用法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|    表达式    |                            描述                            |\n",
    "| :------: | :--------------------------------------------------------: |\n",
    "| nodename |                  选取此节点的所有子节点。                  |\n",
    "|    /     |                       从根节点选取。                       |\n",
    "|    //    | 从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。 |\n",
    "|    .     |                       选取当前节点。                       |\n",
    "|    ..    |                   选取当前节点的父节点。                   |\n",
    "|    @     |                         选取属性。                         |\n",
    "|    *     |                         通配符                         |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**实例**\n",
    "\n",
    "在下面的表格中，列出了一些路径表达式以及表达式的结果："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|   路径表达式    |                             结果                             |\n",
    "| :-------------: | :----------------------------------------------------------: |\n",
    "|    bookstore    |              选取 bookstore 元素的所有子节点。               |\n",
    "|   /bookstore    | 选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！ |\n",
    "| bookstore/book  |        选取属于 bookstore 的子元素的所有 book 元素。         |\n",
    "|     //book      |       选取所有 book 子元素，而不管它们在文档中的位置。       |\n",
    "| bookstore//book | 选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。 |\n",
    "|     //@lang     |                  选取名为 lang 的所有属性。                  |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意：**xpath第一个元素从1开始，python则从0开始"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 选取节点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先载入 lxml 库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lxml import etree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随便举一个 xml 例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "xml = '''\n",
    "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n",
    "\n",
    "<bookstore>\n",
    "\n",
    "<book>\n",
    "  <title lang=\"eng\">Harry Potter</title>\n",
    "  <price>29.99</price>\n",
    "</book>\n",
    "\n",
    "<book>\n",
    "  <title lang=\"eng\">Learning XML</title>\n",
    "  <price>39.95</price>\n",
    "</book>\n",
    "\n",
    "<book>\n",
    "  <title lang=\"chinese\">Tom</title>\n",
    "  <price>10</price>\n",
    "</book>\n",
    "</bookstore>\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "selector = etree.HTML(xml)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个 book 元素: \n",
      "['\\n  ', 'Harry Potter', '\\n  ', '29.99', '\\n']\n",
      "\n",
      "第2个 book 元素: \n",
      "['\\n  ', 'Learning XML', '\\n  ', '39.95', '\\n']\n",
      "\n",
      "最后一个 book 元素: \n",
      "['\\n  ', 'Tom', '\\n  ', '10', '\\n']\n",
      "\n",
      "倒数第2个 book 元素: \n",
      "['\\n  ', 'Learning XML', '\\n  ', '39.95', '\\n']\n",
      "\n",
      "前2个 book 元素: \n",
      "['\\n  ', 'Harry Potter', '\\n  ', '29.99', '\\n', '\\n  ', 'Learning XML', '\\n  ', '39.95', '\\n']\n",
      "\n",
      "选取所有拥有名为 lang 的属性的 title 元素: \n",
      "['Harry Potter', 'Learning XML', 'Tom']\n",
      "\n",
      "选取所有 title 元素，且这些元素拥有值为 chinese 的 lang 属性: \n",
      "['Tom']\n",
      "\n",
      "选取所有的title与price节点: \n",
      "['Harry Potter', '29.99', 'Learning XML', '39.95', 'Tom', '10']\n"
     ]
    }
   ],
   "source": [
    "print('第1个 book 元素: ')\n",
    "print(selector.xpath('//book[1]//text()'))\n",
    "\n",
    "print('\\n第2个 book 元素: ')\n",
    "print(selector.xpath('//book[2]//text()'))\n",
    "\n",
    "print('\\n最后一个 book 元素: ')\n",
    "print(selector.xpath('//book[last()]//text()'))\n",
    "\n",
    "print('\\n倒数第2个 book 元素: ')\n",
    "print(selector.xpath('//book[last()-1]//text()'))\n",
    "\n",
    "print('\\n前2个 book 元素: ')\n",
    "print(selector.xpath('//book[position() < 3]//text()'))\n",
    "\n",
    "print('\\n选取所有拥有名为 lang 的属性的 title 元素: ')\n",
    "print(selector.xpath('//title[@lang]//text()'))\n",
    "\n",
    "print('\\n选取所有 title 元素，且这些元素拥有值为 chinese 的 lang 属性: ')\n",
    "print(selector.xpath('//title[@lang=\"chinese\"]//text()'))\n",
    "\n",
    "print('\\n选取所有的title与price节点: ')\n",
    "print(selector.xpath('//title//text() | //price//text()'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 位置路径表达式\n",
    "\n",
    "位置路径可以是绝对的，也可以是相对的。\n",
    "\n",
    "绝对路径起始于正斜杠( / )，而相对路径不会这样。在两种情况中，位置路径均包括一个或多个步，每个步均被斜杠分割：\n",
    "\n",
    "绝对位置路径：`/step/step/...`\n",
    "\n",
    "相对位置路径：`step/step/...`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Xpath轴\n",
    "轴可以定义相对于当前节点的节点集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element html at 0x2ca0097e988>,\n",
       " <Element body at 0x2ca00878148>,\n",
       " <Element bookstore at 0x2ca0097ee88>,\n",
       " <Element book at 0x2ca0097e608>,\n",
       " <Element book at 0x2ca0097e788>,\n",
       " <Element book at 0x2ca0097e488>]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//title/ancestor::*')   # 选取title节点的所有先辈节点（父、祖父）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element html at 0x2ca0097e988>,\n",
       " <Element body at 0x2ca00878148>,\n",
       " <Element bookstore at 0x2ca0097ee88>,\n",
       " <Element book at 0x2ca0097e608>,\n",
       " <Element title at 0x2ca00987f48>,\n",
       " <Element book at 0x2ca0097e788>,\n",
       " <Element title at 0x2ca00987fc8>,\n",
       " <Element book at 0x2ca0097e488>,\n",
       " <Element title at 0x2ca00995048>]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//title/ancestor-or-self::*')  # 选取 title节点的所有先辈（父、祖父等）以及当前节点本身。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['eng', 'eng', 'chinese']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//title/attribute::*')         # 选取 title节点的所有属性。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|         轴名称         |                           结果                           |\n",
    "| :----------------: | :------------------------------------------------------: |\n",
    "|      ancestor      |          选取当前节点的所有先辈（父、祖父等）。          |\n",
    "|  ancestor-or-self  |  选取当前节点的所有先辈（父、祖父等）以及当前节点本身。  |\n",
    "|     attribute      |                 选取当前节点的所有属性。                 |\n",
    "|       child        |                选取当前节点的所有子元素。                |\n",
    "|     descendant     |         选取当前节点的所有后代元素（子、孙等）。         |\n",
    "| descendant-or-self | 选取当前节点的所有后代元素（子、孙等）以及当前节点本身。 |\n",
    "|     following      |       选取文档中当前节点的结束标签之后的所有节点。       |\n",
    "|     namespace      |             选取当前节点的所有命名空间节点。             |\n",
    "|       parent       |                  选取当前节点的父节点。                  |\n",
    "|     preceding      |       选取文档中当前节点的开始标签之前的所有节点。       |\n",
    "| preceding-sibling  |             选取当前节点之前的所有同级节点。             |\n",
    "|        self        |                      选取当前节点。                      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 功能函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "starts-with函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Tom']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//title[starts-with(@lang,\"ch\")]//text()')    # 选取lang值以ch开头的 title字节"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "contains函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Harry Potter', 'Learning XML']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//title[contains(@lang,\"en\")]//text()')       # 选取lang值包含en的 title字节"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and用法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Harry Potter', 'Learning XML']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 选取lang值包含 en和 g的 title字节\n",
    "selector.xpath('//title[contains(@lang,\"en\") and contains(@lang,\"g\")]//text()') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "文本中部分包含用法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Learning XML']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//title[contains(text(),\"L\")]//text()')       # 选取节点文本中包含 L 的 title节点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "string用法：获取文本，返回字符串格式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "string函数： \n",
      "  Harry Potter\n",
      "  29.99\n",
      "\n",
      "text函数： ['\\n  ', 'Harry Potter', '\\n  ', '29.99', '\\n']\n"
     ]
    }
   ],
   "source": [
    "info = selector.xpath('//title/ancestor::*')\n",
    "strings = info[3].xpath('string(.)')\n",
    "print('string函数：', strings)\n",
    "\n",
    "texts = info[3].xpath('.//text()')\n",
    "print('text函数：', texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 常见问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XPath『不包含』应该怎么写？\n",
    "假设有这样一段HTML代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "html = '''\n",
    "<html>\n",
    "    <head>\n",
    "        <title>测试XPath移除功能</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <div class=\"post\">\n",
    "            <div class=\"quote\">无关紧要的引用内容</div>\n",
    "                你好啊\n",
    "                <strong>产品经理</strong>，\n",
    "                <span>很高兴认识你</span>\n",
    "                。\n",
    "        </div>\n",
    "    </body>\n",
    "</html>\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我想把其中的 **你好啊产品经理，很高兴认识你** 提取出来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['产品经理', '很高兴认识你']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from lxml import etree\n",
    "selector = etree.fromstring(html)\n",
    "selector.xpath('//div[@class=\"post\"]//*[not(@class=\"quote\")]/text()')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但是这里缺少 **你好啊** ，因为它不属于任何子标签。\n",
    "\n",
    "为了单独直接获取`div`下面的内容，我们需要使用 `|`再拼接一个 XPath："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'你好啊产品经理，很高兴认识你。'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = selector.xpath('//div[@class=\"post\"]/text() | //div[@class=\"post\"]//*[not(@class=\"quote\")]/text()')\n",
    "text = ''.join(map(lambda x: x.strip() , data))\n",
    "text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 标签套标签,如何提取成一句完整的话？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "html = '''\n",
    "<div id=\"class3\">\n",
    "    我左青龙,\n",
    "    <span id='tiger'>\n",
    "        右白虎,\n",
    "        <ul>上朱雀,\n",
    "            <li>下玄武.</li>\n",
    "        </ul>\n",
    "        老牛在当中,\n",
    "    </span>\n",
    "    龙头在胸口.\n",
    "</div>\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "selector = etree.HTML(html)\n",
    "data = selector.xpath('//div[@id=\"class3\"]')[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "方法一：使用string函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    我左青龙,\n",
      "    \n",
      "        右白虎,\n",
      "        上朱雀,\n",
      "            下玄武.\n",
      "        \n",
      "        老牛在当中,\n",
      "    \n",
      "    龙头在胸口.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "info = data.xpath('string(.)')   # 实际上是去除了div中间的其他多余标签\n",
    "print(info)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "我左青龙,右白虎,上朱雀,下玄武.老牛在当中,龙头在胸口.\n"
     ]
    }
   ],
   "source": [
    "content2 = info.replace('\\n','').replace(' ','')   # 将换行与空格分别取代\n",
    "print(content2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "方法二：使用text函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\\n    我左青龙,\\n    ',\n",
       " '\\n        右白虎,\\n        ',\n",
       " '上朱雀,\\n            ',\n",
       " '下玄武.',\n",
       " '\\n        ',\n",
       " '\\n        老牛在当中,\\n    ',\n",
       " '\\n    龙头在胸口.\\n']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "info = data.xpath('.//text()')\n",
    "info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "我左青龙,右白虎,上朱雀,下玄武.老牛在当中,龙头在胸口.\n"
     ]
    }
   ],
   "source": [
    "content2 = ''.join(map(lambda x: x.strip() , info))\n",
    "print(content2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "toc-autonumbering": true,
  "toc-showcode": false,
  "toc-showmarkdowntxt": false,
  "toc-showtags": false
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
